-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CNI STATUS Verb #123
Conversation
b47ffec
to
8f10978
Compare
8a5cf10
to
63c2d24
Compare
122 merged rebase pls .. |
Will do once I get home |
6b3dfc8
to
d96cc3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment..
5762dbb
to
37d969a
Compare
Signed-off-by: Michael Zappa <[email protected]>
37d969a
to
208eca9
Compare
@squeed over here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
containerd 2.0.1 Welcome to the v2.0.1 release of containerd! The first patch release for containerd 2.0 includes a number of bug fixes and improvements. * Fix apply IoOwner options when not in user namespace ([#11151](containerd/containerd#11151)) * Fix cri grpc plugin config migration ([#11140](containerd/containerd#11140)) * Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123)) * Update differ to handle zstd media types ([#11068](containerd/containerd#11068)) * Update runc binary to v1.2.3 ([#11142](containerd/containerd#11142)) * Fix panic due to nil dereference cgroups v2 ([#11098](containerd/containerd#11098)) Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues. * Derek McGowan * Wei Fu * Archit Kulkarni * Jin Dong * Phil Estes * Akhil Mohan * Akihiro Suda * Alexey Lunev * Austin Vazquez * Maksym Pavlenko * Mike Brown * Michael Zappa * Samuel Karp * Sebastiaan van Stijn * Andrey Smirnov * Davanum Srinivas <details><summary>50 commits</summary> <p> * Prepare release notes for v2.0.1 ([#11158](containerd/containerd#11158)) * [`b0ece5dc5`](containerd/containerd@b0ece5d) Prepare release notes for v2.0.1 * build(deps): bump actions/attest-build-provenance from 1.4.4 to 2.1.0 ([#11154](containerd/containerd#11154)) * [`fe6957084`](containerd/containerd@fe69570) build(deps): bump actions/attest-build-provenance from 1.4.4 to 2.1.0 * update xx to v1.6.1 for compatibility with alpine 3.21 and file 5.46+ ([#11153](containerd/containerd#11153)) * [`eb2ce6882`](containerd/containerd@eb2ce68) update xx to v1.6.1 for compatibility with alpine 3.21 and file 5.46+ * ctr pull should unpack for default platform when transfer service is used ([#11139](containerd/containerd#11139)) * [`44cdca68b`](containerd/containerd@44cdca6) ctr pull unpack for default platform using transfer service * Fix apply IoOwner options when not in user namespace ([#11151](containerd/containerd#11151)) * [`018d83650`](containerd/containerd@018d836) internal/cri: should not apply IoOwner options * Update go-cni for CNI STATUS ([#11146](containerd/containerd#11146)) * [`5eb7995a9`](containerd/containerd@5eb7995) feat: update go-cni version for CNI STATUS * Fix cri grpc plugin config migration ([#11140](containerd/containerd#11140)) * [`a2302ea89`](containerd/containerd@a2302ea) Add integration test for custom configuration * [`be5eda069`](containerd/containerd@be5eda0) complete cri grpc config migration * Update runc binary to v1.2.3 ([#11142](containerd/containerd#11142)) * [`a53eff53d`](containerd/containerd@a53eff5) update runc binary to v1.2.3 * Update differ to handle zstd media types ([#11068](containerd/containerd#11068)) * [`73f57acb0`](containerd/containerd@73f57ac) Update differ to handle zstd media types * update to go1.23.4 / go1.22.10 ([#11109](containerd/containerd#11109)) * [`290e8bc70`](containerd/containerd@290e8bc) update to go1.23.4 / go1.22.10 * CI: update Fedora to 41 ([#11110](containerd/containerd#11110)) * [`62b790bfa`](containerd/containerd@62b790b) CI: update Fedora to 41 * Fix panic due to nil dereference cgroups v2 ([#11098](containerd/containerd#11098)) * [`3ba2df924`](containerd/containerd@3ba2df9) fix panic due to nil dereference cgroups v2 * Publish attestation as release artifact ([#11067](containerd/containerd#11067)) * [`34a45cab2`](containerd/containerd@34a45ca) Publish attestation as release artifact * Move rockylinux 9.4 to almalinux/9 in CI ([#11053](containerd/containerd#11053)) * [`7dec6b460`](containerd/containerd@7dec6b4) move rocky 9.4 to almalinux/9 in CI * *: should align pipe's owner with init process ([#11035](containerd/containerd#11035)) * [`cf07f28ee`](containerd/containerd@cf07f28) *: should align pipe's owner with init process * fix: set the credentials even if not provided ([#11031](containerd/containerd#11031)) * [`986088866`](containerd/containerd@9860888) fix: set the credentials even if not provided * fsverity_test.go: fix nil pointer derefence, fix test fail, fix minor/major device numbers resolving ([#10978](containerd/containerd#10978)) * [`30b929ece`](containerd/containerd@30b929e) fsverity_test.go: fix major/minor device number resolving * [`10996a334`](containerd/containerd@10996a3) fsverity_test.go: fix nil pointer dereference, fix test fail * update runc binary to 1.2.2 ([#11023](containerd/containerd#11023)) * [`9081e979f`](containerd/containerd@9081e97) update runc binary to 1.2.2 * Revert "Disable vagrant strict dependency checking" ([#11009](containerd/containerd#11009)) * [`6399c936f`](containerd/containerd@6399c93) Revert "Disable vagrant strict dependency checking" * fsverity_linux.go: Fix fsverity.IsEnabled() for big endian systems ([#11005](containerd/containerd#11005)) * [`a7f2b562f`](containerd/containerd@a7f2b56) fsverity_linux.go: Fix fsverity.IsEnabled() for big endian systems * bump github.com/containerd/typeurl/v2 from 2.2.2 to 2.2.3 ([#10997](containerd/containerd#10997)) * [`389e781ea`](containerd/containerd@389e781) build(deps): bump github.com/containerd/typeurl/v2 from 2.2.2 to 2.2.3 * update to go1.23.3 / go1.22.9 ([#10973](containerd/containerd#10973)) * [`5b879f30c`](containerd/containerd@5b879f3) update to go1.23.3 / go1.22.9 * ci: enable marking 2.0 releases as latest ([#10963](containerd/containerd#10963)) * [`458215f6c`](containerd/containerd@458215f) ci: enable marking 2.0 releases as latest * Avoid arch info in the sed/replace when building cri-cni-containerd.tar.gz ([#10968](containerd/containerd#10968)) * [`e99c2b55c`](containerd/containerd@e99c2b5) Avoid arch info in the sed/replace when building cri-cni-containerd.tar.gz </p> </details> <details><summary>7 commits</summary> <p> * Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123)) * [`208eca9`](containerd/go-cni@208eca9) support CNI status verb * Bump github actions dependencies to match containerd CI repo and fix lint ([containerd/go-cni#122](containerd/go-cni#122)) * [`386f475`](containerd/go-cni@386f475) Fix ci.yml indent * [`a9b0675`](containerd/go-cni@a9b0675) Another doc commit to trigger lint? * [`14af454`](containerd/go-cni@14af454) Bump github actions dependency versions * [`9e0d096`](containerd/go-cni@9e0d096) Trivial doc commit to trigger lint </p> </details> * **github.com/containerd/go-cni** v1.1.10 -> v1.1.11 * **github.com/containerd/typeurl/v2** v2.2.2 -> v2.2.3 Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0) * `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`: ✅Recommended. Dynamically linked with glibc 2.31 (Ubuntu 20.04). * `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`: Statically linked. Expected to be used on non-glibc Linux distributions. Not position-independent. In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases) and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too. See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
Hey @mikebrow (cc @MikeZappa87), this seems to cause issues with dual-CNI setups. I am using both Cilium + Multus, and during a node reboot; the node reports Ready and then shortly after goes NotReady. I was not seeing this issue with Kubelet just spams these logs when it happens:
|
Thanks for the reply @MikeZappa87. Here are the files inside /etc/cni/net.d/00-multus.conf
/etc/cni/net.d/05-cilium.conflist
|
The CNI status would return nil for these. I wonder if something else is going on? Are you able to get the containerd logs and perhaps the kubectl describe node output? |
@MikeZappa87 I have some log files, but not sure if they are helpful. |
The level would need to be increased on the runtime to actually determine what is failing |
@buroa can you give the output of kubectl describe node on one of the impacted nodes? |
@MikeZappa87 Doesn't show anything that isn't known already:
|
I am looking for 'Network plugin returns error' since I don't think the issue you are seeing is related to the change. "NetworkUnavailable False Fri, 06 Dec 2024 16:32:35 -0600 Fri, 06 Dec 2024 16:32:35 -0600 CiliumIsUp Cilium is running on this node" |
If it was CNI Status we would see something like: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized |
@MikeZappa87 I set the containerd to debug mode and captured some more things. Here are the logs: |
Is rook not in a healthy state? This might be the problem. I am not familiar with this but right now the errors aren't indicating anything related to the CNI. @mikebrow keep me honest |
@MikeZappa87 It's as healthy as it can. Pods drain when I reboot the node and then they are rescheduled. Rook pods are quite quick to come up, so that are kinda spammed in the logs. I have another buddy that runs the same stack and is having no problems, but they also do not run multus. Once I remove multus, my cluster is as healthy as can be. Again, had no problems with |
To clarify, this works with containerd 2.0.1 with just cilium and when multus is added it breaks? |
@MikeZappa87 Yep. |
I thought multus rewrote the cni conf file? The provided doesn’t show that. In the cilium config map what is the value of cni-exclusive this might be more of a cilium issue |
The config.toml is probably messed up for your 2.0.1 containerd.. I'm not seeing the CRI plugin start.. It should look more like this:
|
@mikebrow It is controlled via Talos. /etc/cri/containerd.toml:
/etc/cri/conf.d/cri.toml:
cri logs:
|
well now that you have containerd up you should be able to get kubelet to reconnect |
@mikebrow Once containerd gets in the bugged state, kubelet can never reconnect to it. It spams |
siderolabs/talos#9496 related perhaps |
Just to note: This is only happening when I reboot + have multus enabled. If I disable multus and reboot, the cluster comes up fine. Once the cluster settles... I can enable multus and nothing breaks. So it's a race somewhere on boot with cilium/multus/containerd. |
is possible you have multiple problems.. as the sandbox store is also reporting errors in your cri log... note: containerd/containerd#10848 (comment) |
They happen on every Kubernetes cluster I have ever touched. Just look scary. Any time you reboot a node, it takes a minute or two to settle things. |
we have a config field in cni setup .. setup_serially set that to true ... |
^ this guy |
@mikebrow On it, moment. Update: still broke :( |
anyhow that would eliminate if there is a timing conflict setting them up in parallel |
If adding multus is causing the issue I would try to confirm that it’s a multus issue. One way you can do this is run cri-o, cilium and Multus. And if it doesn’t work I would try and use cri-o and cilium. If that works it’s a Multus issue. If that’s the case I would reach out to them as well |
We are implementing the CNI Status verb. The Status verb is to provide the container runtime the ability to determine if the runtime should call CNI ADD.
Please merge #122 first
Reference:
ocicni (cri-o/ocicni#196)